Considering faulty behaviors: message loss, out-of-order delivery, component failures

General consideration of faults and fault recovery

To design systems that can deal with faults, one first has to consider what kind of faults should be handled by the system. Then it is important to determine how such failures can be detected and what exceptional error handling is to be foreseen. In this context, one distinguishes between the following three actions:

It is important to distinguish different failure modes:

Fail-safe failures are much easier to detect, localize and recover. A stand-by unit is sufficient, e.g. if the tuning fork does not produce any sound, use the stand-by unit. The total number of units available (n) must be larger than the number of units that may fail (f): n > f

A Byzantine failure can only be detected by comparing the result with the result produced by other components. This leads to the triple redundancy design: In the case of highly reliable systems, one often uses triple redundancy, that is, three identical components that perform the task in parallel. At the end of an operation, a comparison unit compares the three results obtained and if they are not identical one can identify the faulty component (under the assumption that only a single unit fails at a time). Then one uses the result of the other two components (the system is fault-tolerant for a single fault) and tries to replace the faulty component as fast as possible (before the next component may fail). We have n > f + 1 under the assumption that different failing units will not produce the same wrong result; in general we have n > 2*f. This is assuming that we have an absolutely reliable comparison unit. In the case that we have no centralized comparison unit, but a distributed system, we have n > 3*f

Here are some other important concepts:

Faults in distributed systems

One characteristics of distributed systems, as compared with parallel systems, is the fact that one has to assume that some faults may orrur within the system. Therefore the system has to be designed to recover from such faults. For example, the purpose of the first communication protocols, such as the ABP, was to recover from faults in message transfer, either messages delivered with transmission errors or messages completely lost. One has to deal with faults related to message transmission and component failures (see below).

Design strategies for faults related to message transmission

Issues with message delivery

Assuming that there are N system components that communicate with one another through the exchange of messages, the following situations may characterize the message transmission service from component A to component B (here we assume that transmission errors have been detected by reduncancy parameters and lead to message loss). When designing a protocol for some application, one has to consider which of these cases applies and design the protocol accordingly.

Component failures

Fault tolerance for Byzantine failures is very difficult (see for instance Wikipedia). For Fail-safe failures, one usually uses a "are you alive protocol" - that is, a neighbor sends from time to time a message "are you alive" which should be answered by "yes". If the answer does not arrive the neighbor assumes that the component failed.

In the case of a two-party protocol, one usually does not consider component failures, that is, if such a failure occurs, the whole system becomes non-operational. However, there are multi-party protocols that recover from single or multiple component failures - for instance in the case of (a) load-sharing protocols between several servers, (b) peer-to-peer systems, (c) distributed databases, etc. - Note: some of the protocols proposed for the course projects are of this nature.


Created: October 30, 2014